Search CORE

233 research outputs found

Efficient Logging in Non-Volatile Memory by Exploiting Coherency Protocols

Author: Cohen Nachshon
Friedman Michal
Larus James R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/09/2017
Field of study

Non-volatile memory (NVM) technologies such as PCM, ReRAM and STT-RAM allow processors to directly write values to persistent storage at speeds that are significantly faster than previous durable media such as hard drives or SSDs. Many applications of NVM are constructed on a logging subsystem, which enables operations to appear to execute atomically and facilitates recovery from failures. Writes to NVM, however, pass through a processor's memory system, which can delay and reorder them and can impair the correctness and cost of logging algorithms. Reordering arises because of out-of-order execution in a CPU and the inter-processor cache coherence protocol. By carefully considering the properties of these reorderings, this paper develops a logging protocol that requires only one round trip to non-volatile memory while avoiding expensive computations. We show how to extend the logging protocol to building a persistent set (hash map) that also requires only a single round trip to non-volatile memory for insertion, updating, or deletion

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Fine-Grain Checkpointing with In-Cache-Line Logging

Author: Aksun David T.
Avni Hillel
Cohen Nachshon
Larus James R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/02/2019
Field of study

Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time. In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS 19), April 13, 2019, Providence, RI, US

arXiv.org e-Print Archive

Crossref

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

Author: Emami Mahyar
Kamahori Keisuke
Kashani Sahand
Larus James R.
Pourghannad Mohammad Sepehr
Raj Ritik
Publication venue
Publication date: 23/01/2023
Field of study

The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware heavily uses cycle-accurate simulation of register-transfer-level (RTL) designs. The best software RTL simulators can simulate designs at 1--1000~kHz, i.e., more than three orders of magnitude slower than hardware. Faster simulation can increase productivity by speeding design iterations and permitting more exhaustive exploration. One possibility is to use parallelism as RTL exposes considerable fine-grain concurrency. However, state-of-the-art RTL simulators generally perform best when single-threaded since modern processors cannot effectively exploit fine-grain parallelism. This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate runtime synchronization barriers among many simple processors. Manticore relies entirely on its compiler to schedule resources and communication. Because RTL code is practically free of long divergent execution paths, static scheduling is feasible. Communication and synchronization no longer incur runtime overhead, enabling efficient fine-grain parallelism. Moreover, static scheduling dramatically simplifies the physical implementation, significantly increasing the potential parallelism on a chip. Our 225-core FPGA prototype running at 475 MHz outperforms a state-of-the-art RTL simulator on an Intel Xeon processor running at

\approx

3.3 GHz by up to 27.9

\times

(geomean 5.3

\times

) in nine Verilog benchmarks

arXiv.org e-Print Archive

Sirocco: cost-effective fine-grain distributed shared memory

Author: Falsafi Babak
Hill Mark D.
Larus James R.
Schoinas Ioannis
Wood David A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2009
Field of study

Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs targeted uniprocessor-node parallel machines. This paper presents Sirocco, a family of software FGDSMs implemented on a network of low-cost SMPs. Sirocco takes full advantage of SMP nodes by implementing inter-node sharing directly in hardware and overlapping computation with protocol execution. To maintain correct shared-memory semantics, however SMP nodes require mechanisms to guarantee atomic coherence operations. Multiple SMP processors may also result in contention for shared resources and reduce performance. SMP nodes also impact the cost trade-off. While SMPs typically charge higher price-premiums, for a given system size SMP nodes substantially reduce networking hardware requirement as compared to uniprocessor nodes. In this paper, we ask the question “Are SMPs cost-effective building blocks for software FGDSM?” We present experimental measurements on Sirocco implementations ranging from an all-software system to a system with minimal hardware support. Together with simple cost models we show that low-cost SMP nodes: (i) result in competitive performance with uniprocessor nodes, (ii) substantially reduce hardware requirement and are more cost- effective than uniprocessor nodes, (iii) significantly benefit from hardware support for coherence operations, and (iv) are especially beneficial for FGDSMs with high-overhead coherence operation

Infoscience - École polytechnique fédérale de Lausanne

Typing Copyless Message Passing

Author: Bruno Courcelle
Dario Colazzo and Giorgio Ghelli
Frank Piessens
Galen C. Hunt and James R. Larus
Luca Cardelli Simone Martini, John C. M
Luca Padovani
Simon Gay
Simon Gay and Malcolm Hole
Simon Gay and Vasco T. Vasconcelos
Viviana Bono
Publication venue: 'Logical Methods in Computer Science e.V.'
Publication date: 01/01/2011
Field of study

We present a calculus that models a form of process interaction based on copyless message passing, in the style of Singularity OS. The calculus is equipped with a type system ensuring that well-typed processes are free from memory faults, memory leaks, and communication errors. The type system is essentially linear, but we show that linearity alone is inadequate, because it leaves room for scenarios where well-typed processes leak significant amounts of memory. We address these problems basing the type system upon an original variant of session types.Comment: 50 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Episciences.org

Archivio istituzionale della ricerca - Università di Camerino

Institutional Research Information System University of Turin

Mechanisms for cooperative shared memory

Author: Chandra Satish
Falsafi Babak
Hill Mark D.
Larus James R.
Lebeck Alvin R.
Lewis James C.
Mukherjee Shubhendu S.
Palacharla Subbarao
Reinhardt Steven K.
Wood David A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/04/2009
Field of study

This paper explores the complexity of implementing directory protocols by examining their mechanisms - primitive operations on directories, caches, and network interfaces. We compare the following protocols: Dir1B, Dir4B, Dir4NB, DirnNB, Dir1SW and an improved version of Dir1SW (Dir1SW+). The comparison shows that the mechanisms and mechanism sequencing of Dir1SW and Dir1SW+ are simpler than those for other protocols. We also compare protocol performance by running eight benchmarks on 32 processor systems. Simulations show that Dir1SW+'s performance is comparable to more complex directory protocols. The significant disparity in hardware complexity and the small difference in performance argue that Dir1SW+ may be a more effective use of resources. The small performance difference is attributable to two factors: the low degree of sharing in the benchmarks and Check-In/Check-Out (CICO) directives

Infoscience - École polytechnique fédérale de Lausanne

Fine-grain access control for distributed shared memory

Author: Alvin R. Lebeck
Babak Falsafi
Cheriton David R.
Dally William J.
David A. Wood
Falsafi Babak
Ioannis Schoinas
James R. Larus
Nowatzyk A.
Steven K. Reinhardt
Uhlig Richard
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Predicting the Effects of Optimization on Parallel Programs

Author: Larus James R
Publication venue: University of Wisconsin-Madison Department of Computer Sciences
Publication date: 01/01/1990
Field of study

Minds@University of Wisconsin